Please insert disk labeled Windows XP Professional CD-ROM into Drive A:
Blocking scapers in nftables firewall
When you self host, and expose a git server to the internet, you'll find your access log filled with scraperrs. Hence I've had the following in my git nginx config to ask the bots to kindly fuck off. This stops a lot of bots, who respect this
location /robots.txt {
return 200 "User-agent: * # match all bots
Disallow: / # keep them out";
}
Albeit after reading the blog post Stop Scraping my Git Forge! - notashelf.dev i thought let's take another look, and would you look at that lot's entries like the following:
47.79.213.166 - - [03/Aug/2025:02:12:22 +0200] "GET /firmware/sonix-qmk/diff/keyboards/qwertyydox/keymaps/default?id=e7cc5a35c2b80d081207db940777b7537d30a5cd&id2=9808bfaf2616afbe837873d962bc214be3705f90 HTTP/1.1" 403 186 "https://www.google.com/" "Mozilla/5.0 (Linux; Android 10; K) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/133.0.0.0 Mobile Safari/537.36"
101.44.71.209 - - [03/Aug/2025:02:12:26 +0200] "GET /firmware/qmk/commit/keyboards/handwired/k_numpad17/config.h?id=1eb70be4579e3888ea665fec5706b03eac3d2b3e HTTP/2.0" 403 175 "https://git.node5.net/firmware/qmk/commit/keyboards/handwired/k_numpad17/config.h?id=1eb70be4579e3888ea665fec5706b03eac3d2b3e" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36"
$ whois 101.44.71.209 | grep netname OrgName: Huawei-Cloud-HK
$ whois 47.79.213.166 | grep Organization OrgName: Alibaba Cloud LLC (AL-3)
If you look up the IP(s) on bgp.he.net you can find all associated IP prefixes If you copy the text of this page to a text file and grep with this pattern: source
grep -E -o "([0-9]{1,3}[\.]){3}[0-9]{1,3}.*$"
You can get all the IPv4 ranges.
You can import a file e.g. under the server block with: include /etc/nginx/sites-available/blocklist.conf;
blocklist.conf:
# AS136907 HUAWEI CLOUDS
deny 1.178.32.0/20;
deny 1.178.48.0/20;
...
This however will still fill your access logs...
Even better you can block these IPs entirely with NFTables
In /etc/nftables.conf
add the following: source
include "nftables_blocklist.conf"
table inet filter {
set blocklist {
type ipv4_addr; flags interval;
auto-merge
elements = $blocklist
}
chain input_world {
ip saddr @blocklist counter drop
...
nftables_blocklist.conf
define blocklist = {
1.178.32.0/20, # AS136907 HUAWEI CLOUDS
1.178.48.0/20,
...
sudo nft list ruleset | grep '@blocklist'
ip saddr @blocklist counter packets 29 bytes 1732 drop
On a side note i think LLM companies are scraping or are going to scrape git repos heavily, since a good git commit basically works as a recipe on how to complete an isolated task, so long as they're able to rank the input data quality, as the model is only as good as the input data, and there's a lot of noise in a lot of the data.
sqlite> SELECT COUNT(comment) FROM comment WHERE page_url = '/Stop Scraping my Cgit!'; 0